The flights are frequently used as transportation system due to the size of land in US. If you ever traveled across the country, then you might concern the flight delay if you have any important meeting, business work, or if you need to catch the next connecting flight, or so. Therefore, investigating the flight's on-time performance by time frame, carriers, or reasons seems to be reasonable in order to resolve several questions. The following exploratory questions are answered throughout this report:
Since the dataset is huge (total 30,302,072 records in the original dataset), I think it's better to import the dataset as DB which was generated and filtered by using Python and extract necessary columns for each questions using SQL. While cleaning the dataset in Python, the records are only considered if there is at least one delay cause and if delay minutes are greater than or equal to zero. This is done for the overall dataset. The cleaned dataset (total 8,814,394 records) and its summary statistics look like the below.
Import data from DB into R.
For every 15 minutes in dep_delay, the categories in dep_delay_group was determined and this column was included in the original dataset.
For every 500 miles in distance, the categories in distance_group was determined and this column was also included in the original dataset.
## obs_num year quarter carrier dep_delay dep_delay_group distance
## 1 2 2011 3 AA 10 0 2475
## 2 3 2011 3 AA 52 3 2475
## 3 5 2011 3 AA 79 5 2475
## 4 7 2011 3 AA 214 12 2475
## 5 8 2011 3 AA 42 2 2475
## 6 9 2011 3 AA 87 5 2475
## 7 10 2011 3 AA 383 12 2475
## 8 11 2011 3 AA 58 3 2475
## 9 14 2011 3 AA 304 12 3784
## 10 15 2011 3 AA 75 5 3784
## 11 16 2011 3 AA 8 0 3784
## 12 17 2011 3 AA 95 6 3784
## 13 18 2011 3 AA 30 2 3784
## 14 22 2011 3 AA 308 12 3711
## 15 25 2011 3 AA 10 0 3711
## 16 26 2011 3 AA 79 5 3711
## 17 27 2011 3 AA 74 4 3711
## 18 28 2011 3 AA 45 3 3711
## 19 29 2011 3 AA 244 12 3711
## 20 30 2011 3 AA 51 3 3711
## distance_group delay_reason
## 1 10 1
## 2 10 1
## 3 10 1
## 4 10 1
## 5 10 1
## 6 10 1
## 7 10 1
## 8 10 1
## 9 11 1
## 10 11 1
## 11 11 1
## 12 11 1
## 13 11 1
## 14 11 1
## 15 11 1
## 16 11 1
## 17 11 1
## 18 11 1
## 19 11 1
## 20 11 1
## year quarter carrier dep_delay
## 2011: 413456 1:2061974 WN :2132084 Min. : 0.00
## 2012:1563476 2:2305939 EV :1059558 1st Qu.: 21.00
## 2013:2019226 3:2429545 AA : 973360 Median : 39.00
## 2014:2023269 4:2016936 DL : 828152 Mean : 58.05
## 2015:1664532 OO : 798608 3rd Qu.: 73.00
## 2016:1130435 UA : 796603 Max. :2402.00
## (Other):2226029
## dep_delay_group distance distance_group delay_reason
## 1 :2010312 Min. : 24.0 2 :2203340 1:2899782
## 2 :1607225 1st Qu.: 368.0 3 :1751157 2: 304832
## 0 :1327665 Median : 641.0 4 :1361047 3:2569572
## 3 :1043068 Mean : 797.1 1 :1124083 4: 18174
## 4 : 700937 3rd Qu.:1031.0 5 : 930797 5:3022034
## 5 : 496332 Max. :4983.0 7 : 399416
## (Other):1628855 (Other):1044554
3.(a) how did you manipulate the data to prepare it for analysis?
General manipulation is done via python. This data is extracted from DB using SQL. The first 10 records are shown like the below.
obs_num is the unique observation numbers and it is used since some records can have multiple resons for the delay.
## obs_num delay_reason carrier
## 3204615 1 3 AA
## 1 2 1 AA
## 3204616 2 3 AA
## 2 3 1 AA
## 3204617 4 3 AA
## 3 5 1 AA
## 3204618 5 3 AA
## 3204619 6 3 AA
## 4 7 1 AA
## 3204620 7 3 AA
3.(b) How did you handle missing, incomplete, or noisy data?
Missing or incomplete data is removed in python.
The record is kept when there is at least one cause of delay and the delay minutes is greater than or equal to zero.
3.(c) How did you perform data analysis in code?
3.(d) What challenges did you encounter and how did you solve them?
4.(a) Analysis result
4.(b) Additional visualization
3.(a) how did you manipulate the data to prepare it for analysis?
## obs_num year quarter carrier dep_delay_group
## 1 7 2011 3 AA 12
## 2 10 2011 3 AA 12
## 3 14 2011 3 AA 12
## 4 17 2011 3 AA 6
## 5 22 2011 3 AA 12
## 6 29 2011 3 AA 12
## 7 37 2011 3 AA 8
## 8 44 2011 3 AA 7
## 9 47 2011 3 AA 7
## 10 55 2011 3 AA 6
3.(b) How did you handle missing, incomplete, or noisy data?
Using SQL, the duplicate rows for those five columns are omitted while extracting them from DB adding DISTINCT keyword.
In order to compare in the same period (1 year), the plot is drawn with only year 2012 - 2016.
3.(c) How did you perform data analysis in code?
Group by year and carrier, count the frequency for the cases that delayed time is greater than or equal to 90 mintues across 2012-2016, using ddply.
Since the carrier is coded as abbreviation, I created a carrier table consisting of abbreviation and its corresponding fulll name from look-up table that downloaded from the same web source. In order to remove essential part of the name, regular expression is used. After filtering and matching, the data looks like the below.
## carrier year delgp6_count FullName
## 1 9E 2013 10123 Endeavor Air Inc.
## 2 9E 2013 10123 Pinnacle Airlines Inc.
## 4 AA 2015 22487 American Airlines Inc.
## 5 AA 2013 18212 American Airlines Inc.
## 6 AA 2012 16905 American Airlines Inc.
## 7 AA 2014 18955 American Airlines Inc.
## 8 AA 2016 24901 American Airlines Inc.
## 10 AS 2013 2003 Alaska Airlines Inc.
## 11 AS 2012 2509 Alaska Airlines Inc.
## 12 AS 2014 2409 Alaska Airlines Inc.
4.(a) Analysis result
4.(b) Additional visualization
3.(d) What challenges did you encounter and how did you solve them?
3.(a) how did you manipulate the data to prepare it for analysis?
## obs_num year quarter carrier delay_reason
## 1 2 2011 3 AA 1
## 2 3 2011 3 AA 1
## 3 5 2011 3 AA 1
## 4 7 2011 3 AA 1
## 5 8 2011 3 AA 1
## 6 9 2011 3 AA 1
## 7 10 2011 3 AA 1
## 8 11 2011 3 AA 1
## 9 14 2011 3 AA 1
## 10 15 2011 3 AA 1
3.(b) How did you handle missing, incomplete, or noisy data?
3.(c) How did you perform data analysis in code?
3.(d) What challenges did you encounter and how did you solve them?
- Reference:
4.(a) Analysis result
4.(b) Additional visualization